I’m a PhD Candidate at University of Eastern Finland (Doctoral Programme of Clinical Research) doing pharmacogenomic research. We use population level health registry and genomic data in our research and I’m expecting to learn new tools and tips for coding and statistical analysis of big data during this course. Couple of other PhD students recommended this course as useful.
My GitHub repository can be found here: https://github.com/katjahakki/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Sun Dec 5 19:15:17 2021"
The original data file contains JYTOPKYS2 survey data concerning learning and teaching in a statistical course and reference to the data source can be found here https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt. A new analysis dataset learning2014 was created with variables gender, age, attitude, deep, stra, surf and points by combining questions from the original data file. All combination variables were scaled to the original scales by taking the mean and observations with exam points being zero were excluded.
The learning2014 dataset is a dataframe with 166 observations and 7 variables. The observations represent students in the statistical course (rows) and each column contains values of one variable.
Meaning of the variable abbreviations:
attitude = attitude towards statistics
deep = deep learning
stra = strategic learning
surf = surface learning
points = total points of the course exam
### RStudio exercise 2 - The Analysis exercise
date()
## [1] "Sun Dec 5 19:15:17 2021"
# read in data, learning2014.csv file, using read.csv()
learning2014 <- read.csv('~/IODS-project/data/learning2014.csv')
# show first 6 observations of the data
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
# data structure
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# data dimensions, rows and columns
dim(learning2014)
## [1] 166 7
# number of rows in the data
nrow(learning2014)
## [1] 166
# number of columns in the data
ncol(learning2014)
## [1] 7
Summary() function is a generic R function which produces basic statistics of the data. The function is also used to produce result summaries of the results of various model fitting functions.
# summaries, basic statistics of the variables
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# distribution of female and male students
summary(as.factor(learning2014$gender))
## F M
## 110 56
Majority of the students are female students. We can see the minimum, first quartile, median, third quartile and maximum values of the numeric variables with the summary function.
Let’s draw a scatter plot matrix of the data.
# change variable gender from character to factor in order to draw a scatter plot
learning2014$gender = as.factor(learning2014$gender)
# draw a scatter plot matrix of the variables in learning2014
# [-1] excludes the first column (gender)
pairs(learning2014[-1], col=learning2014$gender)
Scatter plot matrices are a way to roughly determine if you have a linear correlation between multiple variables. The variables are in a diagonal line from top left to bottom right and each variable are plotted against each other. The boxes on the upper right of the scatter plot are mirror images of the plots on the lower left.
Let’s create a more advanced plot matrix of the data.
# access the GGally and ggplot2 libraries
#install.packages("GGally")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create plot matrix with ggpairs()
# Adjust the argument mapping of ggpairs() by defining col = gender inside aes().
# Adjust the code a little more: add another aesthetic element alpha = 0.3 inside aes().
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
The plot matrix can be used to visualize variable relationships for several variables in a single plot. Scatter plots of each pair of numeric variable are drawn on the left part of the plot, pearson correlation is shown on the right and variable distribution is displayed on the diagonal. This plot shows all data separately for female (red color) and male (green color) students. Up row shows boxplots and column on the left shows histograms of the variables. The boxplots display the distribution of data based on a five number summary: minimun, first quartile, median, third quartile and maximum. Outliers are shown as points outside of the boxes.
The strongest correlation is between points and attitude variables. When looking at the diagonal distributions, the age variable has a right-skewed distribution, the students are mainly of young age. The barplot in the up left corner shows the gender distribution, female students are the majority group.
Let’s draw a scatter plot of student’s attitude versus exam points with a regression line.
# Access the ggplot2 library
library(ggplot2)
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col = gender))
# define the visualization type (points)
p2 <- p1 + geom_point()
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")
# draw the plot
p4
## `geom_smooth()` using formula 'y ~ x'
The plot and the regression line indicate a linear relationship between student’s attitude and exam points for both female and male students.
Simple linear regression models the relationship between the magnitude of one variable and that of a second - for example, as X increases, Y also increases. Or as X increases, Y decreases. Regression quantifies the nature of the relationship between the variables. The simple linear regression estimates how much Y will change when X changes by a certain amount. With regression, we predict the Y variable from X using a linear relationship.
When there are more than one explanatory variables in the linear model, it is called multiple regression. The response variable is assumed to be normally distributed with a mean that is linear function of the explanatory variables, and a variance that is independent of them.
Let’s fit a multiple linear regression model where exam points is the target variable (Y) and age, attitude and stra are the explanatory variables (X variables) and then print out the summary of the fitted model.
# fit a regression model, exam points is the target (dependent) variable
my_model <- lm(points ~ age + attitude + stra, data = learning2014)
# summary of the fitted model
summary(my_model)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The question is does age, attitude or stra variables explain the amount of exam points?
We can see from the summary that the attitude variable has a statistically significant p-value. That means the variable attitude has a statistically significant relationship with the target variable points. The variables age and stra are not statistically significant, the p-values are > 0.1, meaning that these variables do not have a statistically significant relationship with the target variable points.
Let’s remove age and stra variables from the model and fit a model again without them. The points is the target variable and the attitude is the explanatory variable.
# fit a regression model with statistically significant explanatory variable
my_model2 <- lm(points ~ attitude, data = learning2014)
Let’s print the summary of the simple regression model.
# summary of the fitted my_model2
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In the summary statistics of the regression model, call shows what function and parameters were used to create the model. Residuals show the difference between what the model predicted and the actual value of y. The coefficient estimate contains two rows, the intercept and the slope.The intercept is the expected mean value of Y when X=0. The coefficient standard error measures the average amount that the coefficient estimates vary from the actual average value of the target variable. The coefficient t-value is a measure of how many standard deviations the coefficient estimate is far from 0. The Pr(>t) in the model output relates to the probability of observing any value equal or larger than t.
A small p-value indicates that is unlikely to observe a relationship between the explanatory and target variables due to chance. Three stars in ‘signif.Codes’ represent a highly significant p-value (<0.05).
We can see an estimate 3.5 for the attitude variable which means that if attitude towards statistics score increases by 1 then student’s exam points increase by 3.5 points.
Residual standard error is measure of a linear regression fit and F-statistic is used to indicate whether there is a relationship between the predictor and the response variables. Multiple R-squared ranges from 0 to 1 and measures the proportion of variation in the data that is accounted for in the model, it’s used to assess how well the model fits the data. In this case, 19.06% of the variation in exam points is explained by the variation in attitude towards statistics.
Let’s draw diagnostic plots for the regression model.
# Produce the following diagnostic plots: Residuals vs Fitted values,
# Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(my_model2, which = c(1, 2, 5))
In a linear regression model an assumption is linearity: the target variable is modeled as a linear combination of the mode parameters. Usually it is assumed that the errors are normally distributed. By analyzing the residuals of the model we can explore the validity of the model assumptions. It is also assumed that the errors are not correlated, they have constant variance and the size of a given error does not depend on the explanatory variables.
Residuals vs Fitted values
The constant variance assumption implies that the size of the errors should not depend on the explanatory variables. This can be checked with a scatter plot of residuals versus model predictions. A residual is a measure of how well line fits an individual data point. The closer a data point’s residual is to 0, the better fit. The linearity seems to mainly hold as the red line is close to the dashed line in the plot. Potential problematic cases are with the row numbers 35, 56 and 145.
Normal QQ-plot, normality of the errors
QQ-plot of the residuals is a method to explore the assumption that the errors of the model are normally distributed. In the plot the data points seem to be approximately forming a line according to the dash line verifying the assumption of the residual being normally distributed. Potential problematic cases are with the row numbers 35, 56 and 145.
Residuals vs Leverage, leverage of observations
Leverage measures how much impact a single observation has on the model and the residuals vs leverage plot can be used to identify which observations have an unusually high impact. In the plot most of the cases are well inside of the Cook’s distance lines and the red line follows the dashed line quite well. It seems there’s no major impact of single observations on the model. When data points are outside of a dashed line, Cook’s distance, the cases may be influential to the regression results. Potential problematic cases are with the row numbers 35, 56 and 71.
The data are from two identical questionnaires related to secondary school student alcohol comsumption in Portugal. Data source can be found here: UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/dataset). Metadata is available at: https://archive.ics.uci.edu/ml/datasets/Student+Performance.
A new analysis dataset pormath was created by joining math course and Portuguese language course datasets. The two data sets were joined using all other variables than “failures”, “paid”, “absences”, “G1”, “G2”, “G3” as (student) identifiers.
## RStudio Exercise 3: Analysis
date()
## [1] "Sun Dec 5 19:15:26 2021"
# read in the joined student alcohol consumption data pormath, using read.csv()
pormath <- read.csv('~/IODS-project/data/pormath.csv')
# introduction of the dataset pormath, structure of the data
str(pormath)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
# print out names of the variables
colnames(pormath)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
# rows and columns in the dataset
dim(pormath)
## [1] 370 51
The pormath dataset is a dataframe with 370 observations (rows) and 51 variables (columns). The observations represent students and each column contains values of one variable.
Let’s choose 4 variables and create hypotheses for logistic regression.
The variables:
freetime: free time after school (numeric: from 1 - very low to 5 - very high)
absences: number of school absences (numeric: from 0 to 93)
age: student’s age (numeric: from 15 to 22)
goout: going out with friends (numeric: from 1 - very low to 5 - very high)
The hypotheses:
1: students with free time have a higher probability to high alcohol consumption than students with less free time
2: students with absences have a higher probability to consume more alcohol than students with less absences
3: older students have a higher probability to high alcohol consumption than younger students
4: students who go out with friends have a higher probability to high alcohol consumption than students who go out less with their friends
Numerical overview of the data can be done for example by exploring the frequencies of the variables in cross tables and basic statistic summaries of the variables using table() and summary() R functions.
## frequencies of the variables, cross tables
# low and high alcohol consumption
summary(as.factor(pormath$high_use))
## FALSE TRUE
## 259 111
# alcohol consumption and freetime
table(high_use = pormath$high_use, freetime = pormath$freetime)
## freetime
## high_use 1 2 3 4 5
## FALSE 15 46 112 65 21
## TRUE 2 14 40 40 15
# alcohol consumption and absences
table(high_use = pormath$high_use, absences = pormath$absences)
## absences
## high_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 26 27
## FALSE 50 37 40 31 24 16 16 9 14 5 5 2 3 1 1 0 0 1 0 2 1 0 0
## TRUE 13 13 16 7 11 6 5 3 6 6 2 3 4 1 6 1 1 1 1 0 1 1 1
## absences
## high_use 29 44 45
## FALSE 0 0 1
## TRUE 1 1 0
# alcohol consumption and age
table(high_use = pormath$high_use, age = pormath$age)
## age
## high_use 15 16 17 18 19 20 22
## FALSE 63 74 62 52 7 1 0
## TRUE 18 28 35 25 4 0 1
# alcohol consumption and going out with friends
table(high_use = pormath$high_use, goout = pormath$goout)
## goout
## high_use 1 2 3 4 5
## FALSE 19 82 97 40 21
## TRUE 3 15 23 38 32
In these two-dimensional frequency cross tables we can see the questionnaire results as frequencies of the entire group of students concerning the two chosen variables in each table. We can see that 111 out of 370 (34,7%) students report high use of alcohol. Students’ alcohol ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.
# select variables freetime, absences, age, goout for joined summary statistics
myvars <- c("freetime", "absences", "age", "goout")
new_pormath <- pormath[myvars]
# print basic statistic summary
summary(new_pormath)
## freetime absences age goout
## Min. :1.000 Min. : 0.000 Min. :15.00 Min. :1.000
## 1st Qu.:3.000 1st Qu.: 1.000 1st Qu.:16.00 1st Qu.:2.000
## Median :3.000 Median : 3.000 Median :17.00 Median :3.000
## Mean :3.224 Mean : 4.511 Mean :16.58 Mean :3.116
## 3rd Qu.:4.000 3rd Qu.: 6.000 3rd Qu.:17.00 3rd Qu.:4.000
## Max. :5.000 Max. :45.000 Max. :22.00 Max. :5.000
# access library psych
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# create summary table
describe(pormath[ , c('freetime', 'absences', 'age', 'goout')], fast = TRUE)
## vars n mean sd min max range se
## freetime 1 370 3.22 0.99 1 5 4 0.05
## absences 2 370 4.51 5.50 0 45 45 0.29
## age 3 370 16.58 1.18 15 22 7 0.06
## goout 4 370 3.12 1.13 1 5 4 0.06
We can see the minimum, first quartile, median, third quartile and maximum values of the freetime, absences, age and goout variables with the summary function. Summary table created with describe() function outputs also column number (vars), number of valid cases (n), standard deviation (sd) and standard error (se) for the variables.
Let’s draw bar plots for each variable.
## bar plots to describe distributions of the variables
# access the libraries tidyr, dplyr, ggplot2 and cowplot
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cowplot)
# bar plots for each variable
g1 <- ggplot(pormath, aes(x = high_use, fill = sex)) + geom_bar()
g2 <- ggplot(pormath, aes(x = freetime, fill = sex)) + geom_bar()
g3 <- ggplot(pormath, aes(x = absences, fill = sex)) + geom_bar()
g4 <- ggplot(pormath, aes(x = age, fill = sex)) + geom_bar()
g5 <- ggplot(pormath, aes(x = goout, fill = sex)) + geom_bar()
# draw all bar plots
plot_grid(g1, g2, g3, g4, g5, labels="AUTO")
The bar plots show counts for each variable by gender. Approximately third of the students report high alcohol consumption and of those students the majority are the male students. Most of the students have average (3) or high amount (4) of freetime after school and have approximately 0 to 14 of absences. Majority of the students are between 15 to 18 years of age and the gender distribution is roughly even for male and female students. Going out with friends vary mainly from low (2) to very high (5)
Let’s draw box plots to see the relationships of the variables with the alcohol consumption
## box plots to describe distributions of the variables
# box plots of the variables
g6 <- ggplot(pormath, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot() + ylab("freetime")
g7 <- ggplot(pormath, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absences")
g8 <- ggplot(pormath, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age")
g9 <- ggplot(pormath, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("goout")
# draw boxplots
plot_grid(g6, g7, g8, g9, labels = "AUTO")
In the box plots the median marks the mid-point of the data and is shown by the line that divides the box into two parts. Half the data points are greater than or equal to this value and half are less. The middle “box” represents the middle 50% of data points for the analysis group. Outliers are shown as dots outside the boxes. The box plots display the distribution of the data, if the data is symmetrical, how tightly the data is grouped and if and how the data is skewed. The whiskers are the two lines outside the box, that go from the minimum to the lower quartile (the start of the box) and then from the upper quartile (the end of the box) to the maximum.
We can see from the box plots that:
High alcohol consumption and freetime
- females with low alcohol consumption have greater variability in freetime and less freetime than males with low or high alcohol consumption or females with high alcohol consumption. Roughly it seems that amount of freetime after school doesn’t influence much on alcohol consumption.
High alcohol consumption and absences
- the number of absences is higher for both females and males who report high alcohol consumption.
High alcohol consumption and age
- males with high alcohol use are a bit older than males with low alcohol consumption.
High alcohol consumption and going out with friends
- both female and male students who go out more with friends have higher alcohol consumption.
Based on these box plots hypothesis number 2 (students with more absences consume more alcohol than students with less absences) and number 4 (students who go out with friends consume more alcohol than students who go out less with their friends) could hold. The hypothesis 3 (older students consume more alcohol than younger students) might not hold for the whole group of students.
The logistic regression model is a genearalized linear model (GLM). It can be used to assess the effects of a set of explanatory variables on a binary response variable. The response in the logistic regression formula is the log odds of a binary outcome of 1.
Let’s fit the logistic regression model. The binary high/low alcohol consumption variable is the target variable. In R, to fit a logistic regression, the glm function is used with the family parameter set to binomial.
# create logistic regression model
model <- glm(high_use ~ freetime + absences + age + goout, data = pormath, family = "binomial")
# print summary of the model
summary(model)
##
## Call:
## glm(formula = high_use ~ freetime + absences + age + goout, family = "binomial",
## data = pormath)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7891 -0.7617 -0.5500 0.9668 2.3609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.18857 1.85646 -2.795 0.005192 **
## freetime 0.18307 0.13719 1.334 0.182071
## absences 0.07660 0.02239 3.421 0.000625 ***
## age 0.06957 0.10882 0.639 0.522605
## goout 0.67244 0.12381 5.431 5.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.17 on 365 degrees of freedom
## AIC: 398.17
##
## Number of Fisher Scoring iterations: 4
In the summary of the logistic regression model, call shows the function and parameters used to create the model. Deviance is a measure of goodness of fit of a genearalized linear model. The null deviance shows how well the target variable is predicted by the model that includes only the intercept. The residual deviance shows how well the response variable can be predicted by a model with predictor variables.
High alcohol consumption and absences
Each unit increase in absences increases the log odds of high alcohol consumption by 0.0766 and p-value indicates that it is statistically significant in determining the high alcohol consumption
High alcohol consumption and going out with friends
Each unit increase in going out with friends increases the log odds of high alcohol consumption by 0.672 and p-value indicates that it is statistically significant in determining the high alcohol consumption
The variables freetime and age are not statistically significant, meaning that these variables do not have a statistically significant relationship with the binary target variable high/low alcohol consumption.
The estimated parameters in the logistic regression model can be interpreted in terms of odds and odds ratios. The exponents of the coefficients of a logistic regression model can be interpret as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. Let’s print the coefficients of the model as odds ratios and confidence intervals for them.
# model with statistically significant variables
model2 <- glm(high_use ~ absences + goout, data = pormath, family = "binomial")
# print the coefficients of the model
coef(model2)
## (Intercept) absences goout
## -3.63280208 0.07631808 0.73380079
# compute odds ratios (OR)
OR <- coef(model2) %>% exp
# compute confidence intervals (CI)
CI <- confint(model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their 95% confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02644199 0.01082308 0.06045431
## absences 1.07930582 1.03477610 1.13034717
## goout 2.08298255 1.66258410 2.64170298
The odds ratio (OR) 1.08 with confidence intervals 1.035 to 1.13 for absences means that there is a 3.5% to 13% increase in the odds that students with absences have a higher alcohol consumption compared to students with less absences.
The odds ratio (OR) 2.08 with confidence intervals 1.66 to 2.64 for goout variable means that there is 1.7 to 2.6 times the odds of high alcohol consumption for students who go out with friends compared to students who go out less with their friends.
Let’s explore the predictive power of the logistic regression model to know how well the model actually predicts the target value.
# fit the model
model2 <- glm(high_use ~ absences, goout, data = pormath, family = "binomial")
# predict the probability of high_use
probabilities <- predict(model2, type = "response")
# add the predicted probabilities to pormath
pormath <- mutate(pormath, probability = probabilities)
# use the probabilities to make a prediction of high_use
pormath <- mutate(pormath, prediction = probability > 0.5)
# tabulate the target variable versus the predictions (2x2 cross table)
table(high_use = pormath$high_use, prediction = pormath$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 12
## TRUE 88 23
# initialize a plot of 'high_use' versus 'probability' in 'pormath'
g <- ggplot(pormath, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = pormath$high_use, prediction = pormath$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66756757 0.03243243 0.70000000
## TRUE 0.23783784 0.06216216 0.30000000
## Sum 0.90540541 0.09459459 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = pormath$high_use, prob = 0)
## [1] 0.3
The first 2x2 cross table is a confusion matrix of predictions versus the actual values of high use of alcohol. The plot displays a graphic visualizing of the actual values and the predictions. The second cross table presents the target variable versus the predicted probabilities. In the tables the predicted outcomes are columns and the true outcomes are the rows. The diagonal elements of the tables show the correct predictions and the off-diagonal elements show the incorrect predictions.
The total proportion of inaccurately classified individuals (the training error) is 0.3 meaning that the model classifies incorrectly roughly 30% of the observations.
We’re using the Boston dataset from the MASS package which is already loaded in R and the details of the data source can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
The Boston dataset is a dataframe with 506 observations (rows) and 14 variables (columns). All the variables have numerical values.
Meaning of the variable abbreviations:
crim = per capita crime rate by town
zn = proportion of residential land zoned for lots over 25,000 sq.ft
indus = proportion of non-retail business acres per town
chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox = nitrogen oxides concentration (parts per 10 million)
rm = average number of rooms per dwelling
age = proportion of owner-occupied units built prior to 1940
dis = weighted mean of distances to five Boston employment centres
rad = index of accessibility to radial highways
tax = full-value property-tax rate per $10,000
ptratio = pupil-teacher ratio by town
black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat = lower status of the population (percent)
medv = median value of owner-occupied homes in $1000s
## RStudio Exercise 4: Analysis
date()
## [1] "Sun Dec 5 19:15:29 2021"
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# read in data
data("Boston")
## explore the dataset
# structure
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# dimensions (rows and columns of the data)
dim(Boston)
## [1] 506 14
Let’s print the summary statistics and a plot matrix of the Boston data.
# summary statistics
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
pairs(Boston)
Let’s draw a correlation matrix and a correlation plot.
# access libraries
library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
Correlations between the variables can be calculated with function cor() to create the correlation matrix. The correlation matrix shows correlation coefficients between variables. Correlations plot with corrplot() function visualizes the correlations between variables in the data. The stronger the red color is, the stronger the negative correlation is between the variables in the plot. The strong blue color indicates strong positive correlation between the variables. We can see from the plot that the strongest negative correlation is between indus-dis, nox-dis, age-dis and lstat-medv variable pairs. The strongest positive correlation is between indus-nox, indus-tax, nox-age, rm-medv and rad-tax variable pairs.
Here we’re scaling the data by substracting the column means from the corresponding columns and dividing the difference with standard deviation with scale() function. The scaling makes sense since we have multiple variables across different scales in the original Boston data. By scaling we standardize the data, the scale() function centers and scales the columns of a numeric matrix.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Above a categorical variable of the crime rate (scaled) was created using the quantiles as the break points and the old crime rate variable was dropped from the dataset. The dataset was divided to train and test sets in a way that 80% of the data belongs to the train set. We are using the splitted original and train sets to check how well our model works. The training of the model is done with the train set and prediction on new data is done with the test set.
The MASS package provides a function for LDA (linear discriminant analysis) with R. Linear discriminant analysis produces results based on the assumptions that variables are normally distributed and the normal distributions for each class share the same covariance matrix. We scaled the data above because of these assumptions before doing the LDA. The LDA is a classification method, it calculates the probabilities for the new observation for belonging in each of the classes. The observation is classified to the class of the highest probability. The LDA finds the linear combination of the variables that separate the target variable classes.
Let’s fit the LDA on the train set and draw the LDA plot, the categorical crime rate being the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2549505 0.2376238 0.2475248
##
## Group means:
## zn indus chas nox rm age
## low 0.9605778 -0.8937507 -0.122344297 -0.8669472 0.4536290 -0.8556430
## med_low -0.1169566 -0.2355425 0.033465125 -0.5457455 -0.1384331 -0.3229070
## med_high -0.3782608 0.2105687 0.260819923 0.4530886 0.0972569 0.4205726
## high -0.4872402 1.0149946 0.003267949 1.0344956 -0.4218576 0.8280654
## dis rad tax ptratio black lstat
## low 0.8700574 -0.6953013 -0.7454004 -0.43608720 0.38296388 -0.78475947
## med_low 0.3020064 -0.5615100 -0.4755616 -0.01937294 0.31861497 -0.10953406
## med_high -0.4090673 -0.4160119 -0.3041965 -0.38699600 0.07178924 0.04524294
## high -0.8467313 1.6596029 1.5294129 0.80577843 -0.70432364 0.86680905
## medv
## low 0.521339400
## med_low -0.004939346
## med_high 0.188295902
## high -0.625719386
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09538841 0.754198583 -0.90490874
## indus 0.05404608 -0.178930356 0.29721872
## chas -0.05255782 -0.060338416 0.14639998
## nox 0.21087862 -0.832583555 -1.32120635
## rm -0.14704053 -0.097316794 -0.17496221
## age 0.25280457 -0.188904941 -0.19603335
## dis -0.08380742 -0.304821537 0.09226649
## rad 3.73507325 0.949650630 -0.11279153
## tax 0.04005150 -0.075937806 0.50489433
## ptratio 0.11296447 0.108578993 -0.14085908
## black -0.14092030 0.008421991 0.09977029
## lstat 0.18787340 -0.393768145 0.31921122
## medv 0.18885411 -0.419071960 -0.17299000
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9577 0.0315 0.0108
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Let’s predict the categories with the LDA model on the test data. First, the crime categories from the test set are saved and the categorical crime variable is removed from the test dataset.
# save the correct crime categories from test dataset
correct_classes <- test$crime
# remove the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
# predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results with the crime categories from the test set
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 10 0 0
## med_low 4 18 1 0
## med_high 1 14 13 2
## high 0 0 1 26
The prediction results with the crime categories from the test set are shown in the cross table above. In the table the predicted outcomes are columns and the true outcomes are the rows. The diagonal elements of the table show the correct predictions and the off-diagonal elements show the incorrect predictions. We can calculate from the cross table that 70.6% (72/102) of the predictions are correct.
Distances can be measured with base R’s dist() function. The function creates a distance matrix that is saved as dist object, the distance matrix is usually square matrix containing the pairwise distances of the observations. Similarity (nearness) is determined using a distance metric, which is a function that measureshow far two records are from one another.
To measure the Euclidean distance between two vectors, the one is substracted from the other, differences are squared, they are summed, and the squaer root is taken. Another common distance metric for numeric data is Manhattan distance. Euclidean distance corresponds to the straight-line distance between two points. Manhattan distance is the distance between two points traversed in a single direction at a time. Different distance measures produce different output.
Clustering is a technique to divide data into different groups, where the records on each group are similar to one another. The goal of clustering is to identify significant and meaningful groups of data. K-means divides the data into K clusters by minimizing the sum of the squared distances of each record to the mean of its assigned cluster. K-means does not ensure the clusters will have the same size, but finds the clusters that are best separated. It’s typical to standardize continuous variables, otherwise, variables with large scale will dominate the clustering process.
Let’s standardize the Boston dataset and calculate the distances.
# Reload the Boston dataset and standardize the dataset
# load MASS and Boston
library(MASS)
data('Boston')
# standardize the dataset
boston_standardized <- scale(Boston)
## calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_standardized)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_standardized, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Let’s run k-means algorithm on the Boston dataset. K-means needs the number of clusters as an argument, let’s choose 3. Let’s also visualize the clusters with pairs() function.
# k-means clustering
km <- kmeans(boston_standardized, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_standardized, col = km$cluster)
The clusters seem to somewhat overlap in the plot. To find out the optimal number of clusters, let’s calculate the total of within cluster sum of squares (WCSS) and plot the number of clusters and the total WCSS. The optimal number of clusters is when the total WCSS drops radically.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standardized, k)$tot.withinss})
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
We can see from the plot that the total WCSS drops radically when number of clusters is 2. Let’s visualize the clusters with pairs() function.
# k-means clustering
km <-kmeans(boston_standardized, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_standardized, col = km$cluster)
After repeating the clustering with the optimal number of clusters, 2 differentiated clusters (in black and red color) are now shown in the plot.
The original data is from the United Nations Development Programme and information of the data can be found via these links: http://hdr.undp.org/en/content/human-development-index-hdi http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf https://raw.githubusercontent.com/TuomoNieminen/Helsinki-Open-Data-Science/master/datasets/human_meta.txt
The selected variable names have been shortened and two new variables have been computed. Observations with missing values have been removed, observations indicating regions instead of countries have been removed, the original Country variable has been removed and the countries have been computed as row names of the dataframe. The dataframe has now 8 variables (columns) and 155 observations (rows).
Meaning of the variable abbreviations:
ratio_sec_edu = ratio: proportion of females with at least secondary education / proportion of males with at least secondary education ratio_labour = ratio: proportion of females in the labour force / proportion of males in the labour force
life_exp_birth = life expectancy at birth
exp_years_edu = expected years at schooling
GNI_per_capita = Gross National Income per capita
mat_mort_ratio = maternal mortality ratio
adol_birth_rate = adolescent birth rate
repr_parl = percetange of female representatives in parliament
## RStudio Exercise 5: Analysis
date()
## [1] "Sun Dec 5 19:15:43 2021"
# read in human data
human <- read.table('~/IODS-project/data/human.csv')
# first six observations of the human data
head(human)
## ratio_sec_edu ratio_labour life_exp_birth exp_years_edu
## Norway 1.0072389 0.8908297 81.6 17.5
## Australia 0.9968288 0.8189415 82.4 20.2
## Switzerland 0.9834369 0.8251001 83.0 15.8
## Denmark 0.9886128 0.8840361 80.2 18.7
## Netherlands 0.9690608 0.8286119 81.6 17.9
## Germany 0.9927835 0.8072289 80.9 16.5
## GNI_per_capita mat_mort_ratio adol_birth_rate repr_parl
## Norway 64992 4 7.8 39.6
## Australia 42261 6 12.1 30.5
## Switzerland 56431 6 1.9 28.5
## Denmark 44025 5 5.1 38.0
## Netherlands 45435 6 6.2 36.9
## Germany 43919 7 3.8 36.9
# structure of the human data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ ratio_sec_edu : num 1.007 0.997 0.983 0.989 0.969 ...
## $ ratio_labour : num 0.891 0.819 0.825 0.884 0.829 ...
## $ life_exp_birth : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ exp_years_edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI_per_capita : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ mat_mort_ratio : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adol_birth_rate: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ repr_parl : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# basic statistics of the human data
summary(human)
## ratio_sec_edu ratio_labour life_exp_birth exp_years_edu
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI_per_capita mat_mort_ratio adol_birth_rate repr_parl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Let’s visualize the data with ggpairs() and corrplot function.
# access GGally
library(GGally)
# access dplyr
library(dplyr)
# access corrplot
library(corrplot)
# visualize the human data variables
ggpairs(human)
The correlation matrix done with ggpairs() shows scatter plots of each pair of numeric variable, drawn on the left part of the plot, pearson correlation is shown on the right and variable distribution is displayed on the diagonal. The strongest positive correlation is between exp_years_edu - life_exp_birth and adol_birth_rate - mat_mort_ratio variable pairs. The strongest negative correlation is between mat_mort_ratio - life_exp_birth and mat_mort_ratio and exp_years_edu variable pairs. When looking at the diagonal distributions, the GNI_per_capita, mat_mort_ratio, adol_birth_rate and repr_parl variables have a right-skewed distribution. The ratio_sec_edu, ratio_labour and life_exp_birth variables have a left_skewed distribution. The exp_years_edu variable has a normal distribution.
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
To study linear connections, correlations can also be computed with the cor() function and then visualized with the corrplot function from the corrplot package. The variables are in a diagonal line from top left to bottom right and each variable are plotted against each other. The boxes on the upper right of the plot are mirror images of the plots on the lower left. The stronger the red color is, the stronger the negative correlation and the stronger the blue color is, the stronger the positive correlation is between the variables. We can see the same correlation differences from this plot than from the above correlation matrix but with this visualization the differences are more distinguishable.
The correlation does not imply causation, however it seems that people with higher life expectancy at birth have higher expected years at schooling and the higher adolescent birth rate they have, the higher the maternal mortal ratio is. It also seems that when maternal mortality increases, life expectancy at birth decreases. And when maternal mortality increases, expected years at schooling decrease.
Principal component analysis is a dimensionality-reduction method that is used to reduce the dimensionality of large data sets by transforming a large set of variables into smaller one that still contains most of the information in the large set. The PCA is a technique to discover the way in which numeric variables covary. Singular Value Decomposition (SVD) is the most important tool for reducing the number of dimensions in multivariate data. The 1st principal component (PC1) captures the maximum amount of variance from the features in the original data. The 2nd principal component (PC2) is orthogonal to the first and it captures the maximum amount of variability left. The same is true for each principal component. They are all uncorreleated and each is less important than the previous one in terms of captured variance.
Let’s perform PCA on the not standardized human data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# create and print out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Biplot displays the observations by the first two principal components (PCs), PC1 coordinate in x-axis, PC2 coordinate in y-axis. Arrows represent the original variables.
The PCA method is sensitive regarding the variances of the initial variables. If there are large differences between the ranges of the initial variables, the variables with larger ranges will dominate over those with small ranges, which will lead to biased results. We can see here that it seems like PC1 explains most of the variance in the data if the PCA is performed on the not standardized data, the results are biased.
The point of the data standardization is to standardize the range of the continuous initial variables so that each of them contributes equally to the analysis. By transforming the data to comparable scales with standardization prior to the PCA we can prevent biased results.
Let’s standardize the data and then repeat the PCA.
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## ratio_sec_edu ratio_labour life_exp_birth exp_years_edu
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI_per_capita mat_mort_ratio adol_birth_rate repr_parl
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human2 <- prcomp(human_std)
# create and print out a summary of pca_human
s2 <- summary(pca_human2)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# rounded percentages of variance captured by each PC
pca_pr2 <- round(100*s2$importance[2,], digits = 1)
# print out the percentages of variance
pca_pr2
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
# draw a biplot
biplot(pca_human2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])
After the standardization and rerunning the PCA analysis we can see that now higher number of PCs explain the variance in the data. Here in the biplot, the dimensionality of the human data is reduced to two PCs. The PC1 captures 53.6% of the total variance and the PC2 16.2% of the total variance in the original 8 variables. The observations are on x and y coordinates, defined by the two PCs. The arrows visualize the connections between the original variables and the PCs.
The angle between arrows representing the original variables can be interpret as the correlation between the variables (small angle = high positive correlation) and the angle between a variable and a PC axis can be interpret as the correlation between the two (small angle = high positive correlation). The length of the arrows are proportional to the standard deviations of the variables.
We can see from the biplot that the expected years at schooling, GNI per capita, life expectancy at birth and ratio of secondary education variables correlate to each other and they are pointing to the same direction as PC1, these variables contribute to same dimension. The variables maternal mortality ratio and adolescent birth rate correlates to each other and are pointing to the direction of PC2, these variables contribute to this dimension.
Let’s explore the tea data.
library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
# read in tea data
data(tea)
## explore the data
# structure of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# dimensions of the data (rows and columns)
dim(tea)
## [1] 300 36
Let’s choose six variables for a new tea_time data.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# structure of the data
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# dimensions of the data (rows and columns)
dim(tea_time)
## [1] 300 6
Let’s visualize the tea_time dataset.
# visualize the tea_time dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The tea_time dataframe has 6 variables (columns) and 300 observations (rows). We can see from the barplots the distributions (frequencies for each questionnaire answers) of each factor (categorical) variables. The data consern a questionnaire on tea.
Multiple correspondence analysis (MCA) is also a dimensionality reduction method, it analyses the pattern of relationships of several categorical variables. It’s a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.
Let’s perform MCA with the MCA() function on the tea_time data.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
The MCA summary contains eigenvalues and individuals. The eigenvalues summary gives the variances and the percentage of variances retained by each dimension. The individuals summary gives the individual coordinates, individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions. The categories summary gives the coordinates of the contribution (%), the cos2 and v.test value. The categorical summary gives the squared correlation between each variable and the dimensions.
The MCA biplot shows the variables on the first two dimensions, the distance between variable categories gives a measure of their similarity. Colors represent different variables. The biplot helps to identify variables that are most correlated with each dimension. It can be seen from the biplot that the questionnaire answers tea shop and unpackaged are the most correlated with dimension 1, and the variable other is the most correlated with dimension 2, for example. The dimension 1 captures 15.24% of the total variance and the dimension 2 14.23% of the total variance in these 6 variables.
(more chapters to be added similarly as we proceed with the course!)